Skip to content

Fix CI timeouts 🤞#4458

Merged
connorjward merged 4 commits intoreleasefrom
connorjward/increase-timeouts
Jul 24, 2025
Merged

Fix CI timeouts 🤞#4458
connorjward merged 4 commits intoreleasefrom
connorjward/increase-timeouts

Conversation

@connorjward
Copy link
Contributor

@connorjward connorjward commented Jul 23, 2025

After some exhausting debugging I think I identified the source of the latest hangs.

The issue was introduced in #4391. It turns out that mpiexec -n 1 pytest ... can hang, whereas mpiexec -n 2 pytest ... doesn't! To find this I had to SSH into the runner and gdb into the hanging process where it was spinning in ompi_finalize. The fix is therefore just to call pytest instead of mpiexec -n 1 pytest.

I think that this took so long to find because (a) the error is stochastic, and (b) we have also been getting timeouts due to thermal throttling/oversubscription. I've therefore also increased the timeouts for some steps so we shouldn't mix up slowdowns with hanging.

Hail Mary to try and get things passing.
This was causing a hang during MPI_Finalize.
@connorjward connorjward changed the title Double all timeouts in CI Fix CI timeouts Jul 24, 2025
@connorjward connorjward changed the title Fix CI timeouts Fix CI timeouts 🤞 Jul 24, 2025
@connorjward connorjward requested a review from JHopeCollins July 24, 2025 08:59
@JHopeCollins
Copy link
Member

JHopeCollins commented Jul 24, 2025

That's weird. Is it really only happening for serial jobs, or are those just the ones where mpiexec isn't actually necessary so we can avoid using it? There are still timeouts for the 2 processor tests in the most recent checks.

@connorjward
Copy link
Contributor Author

That's weird. Is it really only happening for serial jobs, or are those just the ones where mpiexec isn't actually necessary so we can avoid using it? There are still timeouts for the 2 processor tests in the most recent checks.

Yeah it genuinely is only for serial jobs. The timeouts for n=2 are a legit deadlock with a helpful traceback and everything.

The worst failure mode I observed was the TSFC tests (serial and take ~3 mins) timing out after 2 hours even though the logs said that everything had succeeded. The hang was in the teardown.

Copy link
Member

@JHopeCollins JHopeCollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I still think it's bizarre that this is only happening with the serial tests, but seeing as the only failing tests are the hang on nprocs=2 and the stochastic stokes convergence failure we've been seeing, lets get this merged and see if it helps.

@connorjward connorjward merged commit ec176d3 into release Jul 24, 2025
9 of 28 checks passed
@connorjward connorjward deleted the connorjward/increase-timeouts branch July 24, 2025 13:47
pbrubeck pushed a commit that referenced this pull request Jul 30, 2025
* Do not use 'mpiexec -n 1' for serial tests

This was causing a hang during MPI_Finalize.

* modify timeouts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants